Fincance vs Bird Time Series Classification Models

Author

Jason Abi Chebli

Published

April 12, 2024

How well can I build a simple classifier?

a)

The financial time series and audio tracks of birds used in this analysis contains records for 974 time series, containing the five predictor variables - including linearity, entropy, x_acf1, covariate1 and covariate2 - which are meant to be used to distinguish between financial time series and audio tracks of birds, with the true classification being stored in the type variable - classifying something as “birdsongs” or “finance”.

A summary of the mean and standard deviation of the two types of time series for the different predictor variables can be seen in Table 2.

Table 2: Summary Statistics for Birdsong and Finance Time Series

variable mean birdsong sd birdsong mean finance sd finance
linearity -0.047 0.785 15.385 24.845
entropy 0.838 0.226 0.526 0.408
x_acf1 0.204 0.598 0.492 0.686
covariate1 3.002 0.510 3.002 0.490
covariate2 1.021 0.989 2.469 1.007

As can be seen in Table 2, in regards to:

  • linearity: finance and birdsongs have extremely different means, with financial time series having much higher mean and variance, making linearity very strong feature for distinguishing the difference in type.
  • entropy: birdsongs have a higher mean and lower variance than financial time series, resulting in more consistent entropy. Overall, there is a clear difference between the two types mean and standard deviation indicating that entropy may be useful.
  • x_acf1: finance and birdsongs have a very strong difference in mean but similar variance, making x_acf1 a useful feature. (Note that financial time series have a higher autocorrelation on average which makes sense intuitively)
  • covariate1: Almost identical means and standard deviations between finance and birdsongs, making covariate1 not a useful feature for distinguishing the difference in type.
  • covariate2: finance and birdsongs have a strong difference in mean and similar variance, meaning covariate2 may be quite informative.

Overall, from this quick numerical analysis, it is looking like: linearity, entropy, x_acf1 and covariate2 may be key features in distinguishing between the difference in financial time series and audio tracks of birds while covariate1 does not seem to be useful in helping distinguish the difference.

Figure 1’s density plots help visualise the insights that can be seen in Table 2.

Figure 1: Density Plots by Feature and Type

Figure 1 illustrates:

  • covariate1: birdsongs and finance have a normal distribution with similar mean and variance.
  • covariate2: birdsongs and finance have a normal distribution with similar variance but a different mean.
  • entropy: finance experiences a bimodal distribution while birdsongs experiences a normal distribution, with different variance
  • linearity: birdsongs and finance seem to be normally distributed, however, the mean and variance of birdsongs and finance are extremely different.
  • x_acf1: birdsongs and finance have a bimodal distribution with a slightly different variance and mean.

Overall, the two types of time series are mainly distinguishable through linearity, entropy, x_acf1 and covariate2 while extremely not distinguishable by covariate1. This visual conclusion aligns with the numerical conclusion.

The assumptions for linear discriminant analysis (LDA) include:

  • the distribution of the predictors is a multivariate normal
  • the same variance-covariance matrix

It is clear from Figure 1 that neither of these conditions were met:

  • the distributions for some of the variables (such as x_acf1, entropy) are bimodal and not normally distributed
  • there is a large variance differences, especially in linearity.

Consequently, the LDA assumptions do not hold.

b)

The data was broken into a training set and a testing set as seen in the code below. The split was 70/30, meaning 70% of the data went to training, 30% of the data went to testing, and the split made sure that the proportion of birdsongs and finance types are near-equal in both the training and testing split.

# Break the data into training and test samples, appropriately (70/30 split).
financebirds_strata <- financebirds |> initial_split(prop = 0.7, strata = type)
financebirds_train <- training(financebirds_strata)
financebirds_test <- testing(financebirds_strata)
c)

Even though the assumptions do not hold, an LDA and logistic regression model was fitted to the training data. Please see the .qmd file if you would like to see the code.

d)

Variable Importance

Regarding the LDA model fit, a summary of the linear discriminant of the different features can be seen in Table 3.

Table 3: LDA Fit Linear Discriminants

Feature LD1
linearity 0.020
entropy -1.079
x_acf1 -0.136
covariate1 0.041
covariate2 0.810

According to Table 3, for the LDA model, entropy is the most influential feature, followed closely by covariate2 as the second most influential feature. Meanwhile the other features contribute much less and may not improve classification much - especially x_acf1 and and linearity.

Regarding the logistic regression model fit, a summary of the logistic regression coefficients can be seen in Table 4.

Table 4: Logistic Regression Fit Coefficients

term estimate std.error statistic p.value
(Intercept) -1.513 0.767 -1.973 0.048
linearity 0.048 0.009 5.277 0.000
entropy -2.047 0.433 -4.732 0.000
x_acf1 -0.277 0.207 -1.341 0.180
covariate1 0.142 0.216 0.661 0.509
covariate2 1.416 0.124 11.408 0.000

We can see from Table 4 that, for the Logistic Regression model, given their quite high p-value (p > 0.05), it indicates that x_acf1 and covariate1 are not statistically significant, meaning that there is strong evidence that they are not useful predictors. Meanwhile, linearity, entropy and covariate2, with a low p-value, indicate that they are statistically significant - meaning that there is strong evidence that they are useful predictors. Looking at the coefficient estimates, entropy is very strongly negative, while covariate2 is strongly positive and linearity is also strongly positive.

Confusion Matrices and Accuracy

The confusion matrix for the LDA model and the Logistic Regression model on the test set can be seen in Table 5 and Table 6, respectively.

Table 5: LDA Model Confusion Matrix

type birdsongs finance cl_acc
birdsongs 101 42 0.706
finance 21 129 0.860

Table 6: Logistic Regression Model Confusion Matrix

type birdsongs finance cl_acc
birdsongs 124 19 0.867
finance 32 118 0.787

As can be seen from the confusion matrices in Tables 5 and 6, for the test set, the LDA model is better at predicting financial time series than the Logistic Regression model - correctly predicting finacial time series 79.3% of the time compared to the Logistic Regression model’s 70.7%. Meanwhile the Logistic Regression model is better at predicting audio tracks of birds than the LDA model. In fact, the Logistic Regression model correctly predicts birdsongs 86.7% of the time while the LDA model correctly predicts bird songs 76.2% of the time in the test set.

From the confusion matrices, the accuracy and balanced accuracy for the two different models can be determined.

For the LDA model,

\[accuracy = 78.5\%\] \[balanced \ accuracy = 78.31\%\]

For the Logistic Regression,

\[accuracy = 82.59\%\] \[balanced \ accuracy = 82.69\%\]

As can be seen, the Logistic Regression model has a higher accuracy and balanced accuracy, while, the LDA model has a slightly lower accuracy and balanced accuracy. Overall, if we were to only use this metric, it indicates that the Logistic Regression model was more appropriate out of the two. However, this would not be the best way to choose the most accurate model as these measures are for one cut-off point (by default 0.5). Area Under the Curve (AUC) of a Receiver Operating Characteristic (ROC) evaluates the accuracy for every single cut-off point, making the conclusions drawn from it producing a model with the most accurate probabilities. More information on ROC Curve for these two models is seen in Section 2e).

Figure 2 and Figure 3 illustrate the different models confidence in predicting the correct class.

Figure 2: LDA Model Confidence in Correct Class

Figure 3: Logistic Regression Model Confidence in Correct Class

As can be seen in Figure 2, the LDA model is more confident at classifying financial time series than audio tracks of birds, with a higher median, meanwhile, the Logistic Regression model is highly confident at classifying both with a few incorrectly-looking low confidence values even when correct (as seen in Figure 3).

Mistakes

In total there are 74 unique mistakes made among the LDA model and the Logistic Regression model. A glimpse at the first 10 mistakes can be seen in Table 7.

Table 7: Glimpse at the Mistakes made by the LDA and Logistic model

lda_pred logistic_pred type same_mistake
finance birdsongs birdsongs FALSE
finance birdsongs birdsongs FALSE
finance finance birdsongs TRUE
finance birdsongs birdsongs FALSE
finance birdsongs birdsongs FALSE
finance finance birdsongs TRUE

Breaking this down further, Table 8 outlines a summary of the mistakes made by the two linear models.

Table 8: Summary of the Mistakes Made by LDA and Logistic Regression Model

lda_total_mistakes logistic_total_mistakes same_mistake_count
63 51 40

As can be seen in Table 8, there are 50 mistakes done by both LDA and logistic regression model, meaning that 76.92% of LDA model mistakes and 79.37% of Logistic regression model mistakes are based on the same observations. That is a very large chunk.

To better understand why this is a case, Figure 4 and Figure 5 helps us investigate.

Figure 4: LDA Model: Grand Tour and Confusion Scatterplot

Figure 5: Logistic Regression Model: Grand Tour and Confusion Scatterplot

As can be seen in Figure 4 and Figure 5, the grand tours illustrates that it can be confusing to classify the correct type sometimes due to the overlap of the two time series types. Consequently, as can be seen in the confusion scatter plot for both LDA and Logistic Regression (Figure 4 and 9), there are misclassifications for both birdsongs and finance. Interestingly, it looks like the misclassification amount is roughly similar for both types and both models.

e)

The ROC curves for the LDA model and the Logistic Regression model can be seen in Figure 6.

Figure 6: LDA vs Logistic Regression Model ROC Curves

As can be seen in Figure 6, the ROC curve for the two models are very similar, with the LDA ROC curve looking marginally better. To confirm this, we can check the area under the ROC curve (known as AUC), which has been summarised in Table 9.

Table 9: Area Under the ROC Curve for the LDA and Logistic Regression Model

Model AUC
LDA 0.898
Logistic Regression 0.894

As can be seen in Table 9, the LDA model had a marginally better AUC value, which supports our visual conclusion made about the ROC curve. The ROC tells us how well the model separates the two classes regardless of a decision threshold. Consequently, out of the two, the LDA model is the better model. However, I want to note that as the AUC for both of the models are so similar, the difference between the two models is minimal, meaning that if one wanted a more simple model with similar accuracy, the Logistic Regression model may be suitable.

f)

From the LDA and Linear Regression model, we can conclude how the time series for financial data and birdsongs typically differ. From both of the fitted models, we saw from Table 3 and Table 4 that the variables entropy and covariate2 are very significant in helping to distinguish between birdsongs and financial data. Additionally, linearity can also be critical in helping distinguish between birdsongs and financial data, with such contrasting means and variance between the two types as seen in Table 2. However, there are around 50 observations that both models found difficult to correctly predict (as outlined in Table 8), contributing to the major drawbacks in helping distinguish between the different time series. When looking at the tour and confusion scatterplot in Figure 4 and 9, we saw that in general the difference between the two types are easily distinguishable, except for a few (~50) points that overlap in some frames of the tour.

Overall, we can see from the fitted model and summary statistics that the time series for financial data and birdsongs typically differ due to entropy, covariate2 and linearity.

Tuning a non-linear classifier

a)

Using the tidymodels style of coding, a “bad” decision tree was fitted to the training data, with a min_n = 1 and cost_complexity = 0. This means we are training a tree with the minimum number of samples required to split being 1, and no penalty for adding extra branches. Consequently, this “bad” decision tree, is going to be extremely overfitted. A plot of this extremely overfitted, “bad” decision tree can be seen in Figure 7.

Figure 7: “Bad” Decision Tree Fit

This “bad” decision tree is so complex and overfitted that the data is extremely hard to read. In fact, there are 73 terminal nodes, and 145 branches in this “bad” decision tree. This tree shows some deep splits and many branches indicating that it did a good job ‘memorising’ patterns in the training data.

In investigating the importance of the variables for this “bad” decision tree, a summary table can be seen in Table 10.

Table 10: Variable Importance for the “Bad” Decision Tree

x
linearity 164.4
covariate2 130.1
entropy 107.2
x_acf1 80.9
covariate1 21.0

As can be seen in Table 10, linearity and covariate2 are extremely significant in helping distinguish the type, while entropy and x_acf1 are also significant, and covariate1 being the least significant.

Table 11 and Table 12 both give us insights into how well the “bad” decision tree performed on the training data set and the testing data set, respectively.

Table 11: Confusion Matrix on Training Data for “Bad” Decision Tree

type birdsongs finance cl_acc
birdsongs 331 NA 1
finance NA 350 1

Table 12: Confusion Matrix on Testing Data for “Bad” Decision Tree

type birdsongs finance cl_acc
birdsongs 127 16 0.888
finance 16 134 0.893

As can be seen in Table 11, the “bad” decision tree ending up perfectly guessing/fitting every single observation in the training set - a clear sign of overfitting. Meanwhile, when looking at the test set confusion matrix seen in Table 12, the “bad” decision tree, didn’t end up performing as “bad” as one might expect, with it correctly predicting birdsongs 87.4% of the time and finance 84.7% of the time. Although the model is overfitted, it actually seemed to perform okay.

On the training set:

\[accuracy = 100\%\] \[balanced \ accuracy = 100\%\]

On the test set:

\[accuracy = 89.08\%\] \[balanced \ accuracy = 89.07\%\]

As can be seen, regarding the training set, as expected it achieved a perfect score of 100% for accuracy and balanced accuracy due to overfitting the data. Surprisingly on the test set it continued to still perform alright, with an accuracy of 89.08% and a balanced accuracy of 89.07%. This overfitted, “bad” decision tree actually has a higher accuracy and balanced accuracy on the test set than the LDA or Logistic Regression model. (Again note that for a more correct analysis, to compare the models the ROC curves must be compared, as done in Section 3C).

b)

Using the capabilities in tidymodels, the optimal tree parameters - tree_depth, min_n, cost_complexity - were determined. The code used to determined the optimal tree parameters can be seen below.

# Define the tunable decision tree spec
tune_spec <- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune(),
    min_n = tune()
  ) |> 
  set_engine("rpart") |> 
  set_mode("classification")

# Create a grid of parameters to tune
tree_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

# Set up cross-validation folds
set.seed(234)
financebirds_folds <- vfold_cv(financebirds_train, v = 5, strata = type)

# Create a workflow
tree_wf <- workflow() |>
  add_model(tune_spec) |>
  add_formula(type ~ linearity + entropy + x_acf1 + covariate1 + covariate2)

# Perform the tuning
set.seed(345)
tree_res <- tree_wf |> 
  tune_grid(
    resamples = financebirds_folds,
    grid = tree_grid,
    metrics = NULL  # Computes a standard set of metrices
  )

# View and summarise results
tree_res |> collect_metrics() |> slice_head(n = 6)

# Find the best combination (based on AUC)
tree_top_5 <- tree_res |> show_best(metric = "roc_auc")
tree_best <- tree_res |> select_best(metric = "roc_auc")

# Finalize workflow and fit on training data
tuned_tree_wf <- tree_wf |> finalize_workflow(tree_best)
tuned_tree_fit <- tuned_tree_wf |> fit(data = financebirds_train)

To determine the most optimal tree parameters, I defined a tune-able decision tree model specification with cost complexity, tree depth, and minimum node size as hyperparameters. I then created a regular grid of parameter combinations (with five levels for each hyperparameter) and implemented five-fold cross-validation. Each model in the grid was trained and evaluated using default classification metrics. After tuning, I collected and analysed the performance metrics and selected the best hyperparameter combination based on ROC AUC.

The top 5 best performing tuned decision tree models, based on ROC AUC, can be seen in Table 13.

Table 13: Top 5 Tuned Decision Tree Models Ranked by ROC AUC Performance

cost_complexity tree_depth min_n .metric .estimator mean n std_err .config
1.00e-10 11 21 roc_auc binary 0.946 5 0.004 Preprocessor1_Model066
1.78e-08 11 21 roc_auc binary 0.946 5 0.004 Preprocessor1_Model067
3.16e-06 11 21 roc_auc binary 0.946 5 0.004 Preprocessor1_Model068
5.62e-04 11 21 roc_auc binary 0.946 5 0.004 Preprocessor1_Model069
1.00e-10 15 21 roc_auc binary 0.946 5 0.004 Preprocessor1_Model071

As can be seen in Table 13, the ROC AUC mean value is similar for the top 5, however, what makes the first model (seen in the first row) superior is the combination of a low cost_complexity and lower tree_depth.

Consequently, the hyperparameters that will lead to the best model can be seen in Table 14.

Table 14: Best Tuned Decision Tree Hyperparameters Model Based on ROC AUC

cost_complexity tree_depth min_n
1e-10 11 21
c)

Using the optimal hyperparameters, min_n = 30, tree_depth = 8 and cost_complexity = 1e-10, the tuned decision tree was fitted to the training data. This means we are training a tree with the minimum number of samples required to split being 30, the maximum number of branches from the root being 8 and a very small penalty for adding extra branches. Consequently, this “tuned” decision tree, should not overfit the data like the “bad” decision tree. A plot of this tuned decision tree can be seen in Figure 8.

Figure 8: Tuned Decision Tree Fit

As can be seen in Figure 8, the tuned decision tree is somewhat complex, but relative to the “bad” decision tree seen in Figure 7, it is not even close. In fact, in the tuned decision tree, there are 15 terminal nodes, and 14 branches.

In investigating the importance of the variables for this tuned decision tree, a summary table can be seen in Table 15.

Table 15: Variable Importance for the Tuned Decision Tree

x
linearity 132.69
covariate2 120.70
entropy 81.67
x_acf1 70.17
covariate1 2.77

Exactly the same as for the ‘bad’ decision tree, we can see from Table 15 that the linearity and covariate2 are extremely significant in helping distinguish the type for the tuned decision tree, while entropy and x_acf1 are also significant, and covariate1 being the least significant.

Table 16 and Table 17 both give us insights into how well the tuned decision tree performed on the training data set and the testing data set, respectively.

Table 16: Confusion Matrix on Training Data for Tuned Decision Tree

type birdsongs finance cl_acc
birdsongs 306 25 0.924
finance 26 324 0.926

Table 17: Confusion Matrix on Testing Data for Tuned Decision Tree

type birdsongs finance cl_acc
birdsongs 126 17 0.881
finance 17 133 0.887

As can be seen in Table 16, the tuned decision tree ending up correctly predicting birdsongs 93.7% of the time and finance 90.3% of the time in the training set - unlike the overfitted model that achieved 100% in the training set. When looking at the test set confusion matrix seen in Table 17, the tuned decision tree correctly predicting birdsongs 93.0% of the time and finance 82.7% of the time. Overall, this is quite a good result.

On the training set:

\[accuracy = 92.51\%\] \[balanced \ accuracy = 92.51\%\]

On the test set:

\[accuracy = 88.4\%\] \[balanced \ accuracy = 88.39\%\]

As can be seen, regarding the training set, the accuracy and balanced accuracy are very high and on the test set it continued to still perform quite well, with an accuracy of 88.4% and a balanced accuracy of 88.39%. This tuned decision tree has a higher accuracy and balanced accuracy on the test set than the “bad” decision tree, LDA or Logistic Regression model.

Now in terms of which model is the most accurate and the best choice, Figure 9 illustrates the ROC Curve for the “bad” decision tree and the tuned decision tree.

Figure 9: ROC Curves for Bad vs Tuned Decision Trees

As can be seen in Figure 9, the tuned decision tree has a much better ROC curve than the “bad” decision tree. Additionally, the “bad” decision tree ROC curve seems to follow a two linear piecewise function. Now checking the area under the ROC curve, a summary can be seen in Table 18.

Table 18: Area Under Curve (AUC) for the different decision tree models.

Model AUC
Bad Decision Tree 0.891
Tuned Decision Tree 0.934

Regarding the ROC Curve and the AUC, the “bad” decision tree performed worse than LDA or Logistic Regression Model - having a lower AUC. However, the tuned decision tree had a much better ROC Curve and AUC value than the “bad” decision tree, LDA and Logistic Regression model. Therefore, in practice I would use the tuned decision tree over any of the other three models, as it is reasonably more accurate. This decision makes sense because, given the data set, a tuned decision tree will definitely outperform the linear models as one clear linear separation is not apparent when looking at the data (see Figure 4 and 9 Grand Tours to visualise it).

Which is the better classifier?

a)

A Random Forest model was developed to distinguish between financial time series and audio tracks of birds. Using the randomForest engine with 1000 trees, I tuned mtry and min_n through a 5-fold stratified cross-validation. I created a regular grid over the ranges mtry = 1 to 5 and min_n = 2 to 20, using 5 levels for each parameter. I selected the best hyperparameters for the model based on ROC AUC and finalised the workflow before fitting it to the full training set.

Note: I also played around with using the Ranger engine, however, randomForest seemed to perform better.

The top 5 best performing tuned random forest models, based on ROC AUC, can be seen in Table 19.

Table 19: Top 5 Tuned Random Forest Models Ranked by ROC AUC Performance

mtry min_n .metric .estimator mean n std_err .config
5 15 roc_auc binary 0.947 5 0.005 Preprocessor1_Model20
3 15 roc_auc binary 0.947 5 0.006 Preprocessor1_Model18
5 11 roc_auc binary 0.947 5 0.005 Preprocessor1_Model15
4 11 roc_auc binary 0.947 5 0.005 Preprocessor1_Model14
4 15 roc_auc binary 0.947 5 0.006 Preprocessor1_Model19

As can be seen in Table 19, the ROC AUC mean value is similar for the top 5. Interestingly, we can see from row 4 that you can achieve extremely high ROC AUC with just mtry = 2 and min_n = 15, which indicates that there are really only two variables that do most of the distinguishing (which is not a surprise, when looking at our previous analysis). Table 20 illustrates the variable importance for the random forest.

Table 20: Variable Importance for the Random Forest

MeanDecreaseGini
linearity 121.43
entropy 18.04
x_acf1 32.69
covariate1 6.46
covariate2 113.47

As can be seen in Table 20, and as seen the other models in this analysis, for a random forest model, linearity and covariate2 are extremely significant in helping distinguish the type, while entropy and x_acf1 are less significant, and covariate1 is near negligibly significant. Interestingly, covariate2 is slightly higher than linearity for the random forest model, which is not the case for the tuned decision tree, “bad” decision tree, LDA and Linear Regression model. Consequently, for the random forest model, covariate2 is the most important variable in distinguishing the time series type.

Overall, the hyperparameters that will lead to the best random forest model can be seen in Table 21.

Table 21: Best Tuned Random Forest Parameters Model Based on ROC AUC

mtry min_n
5 15

Table 22 and Table 23 both give us insights into how well the random forest model performed on the training data set and the testing data set, respectively.

Table 22: Confusion Matrix on Training Data for Random Forest Model

type birdsongs finance cl_acc
birdsongs 314 17 0.949
finance 18 332 0.949

Table 23: Confusion Matrix on Testing Data for Random Forest Model

type birdsongs finance cl_acc
birdsongs 134 9 0.937
finance 19 131 0.873

As can be seen in Table 22, the random forest model ending up correctly predicting birdsongs 94.3% of the time and finance 93.1% of the time in the training set. When looking at the test set confusion matrix seen in Table 23, the random forest model correctly predicting birdsongs 93.7% of the time and finance 84.0% of the time. Overall, this is quite a good result and the confusion matrix on the test set is looking better for the random forest model than the other models so far. Note that throughout, and even for the random forest model, the models incorrectly predict birdsong for time series that are finance more often than vice versa.

On the training set:

\[accuracy = 94.86\%\] \[balanced \ accuracy = 94.86\%\]

On the test set:

\[accuracy = 90.44\%\] \[balanced \ accuracy = 90.52\%\]

As can be seen, regarding the training set, the random forest model’s accuracy and balanced accuracy are reasonably high (higher than the tuned decision tree model on the training set). For the test set, the random forest model also achieves a high accuracy and balanced accuracy. This random forest model has a higher accuracy and balanced accuracy on the test set than the tuned decision tree, “bad” decision tree, LDA or Logistic Regression model.

Although random forest is doing very well, there still seems to be errors. Even when playing around with the hyperparameters and trying to train a more complex random forest, the results don’t improve that much. To better understand why, Table 24 illustrates the probability the random forest had for each type and the vote difference, arranged in order from the smallest vote to largest.

Table 24: Random Forest Predictions Sorted by Vote Difference

.pred_birdsongs .pred_finance vote_diff predicted_class actual
0.480 0.520 0.040 finance birdsongs
0.521 0.479 0.042 birdsongs birdsongs
0.523 0.477 0.046 birdsongs birdsongs
0.525 0.475 0.050 birdsongs birdsongs
0.475 0.525 0.050 finance finance
0.470 0.530 0.060 finance birdsongs
0.531 0.469 0.062 birdsongs birdsongs
0.546 0.454 0.092 birdsongs birdsongs
0.570 0.430 0.140 birdsongs finance
0.572 0.428 0.144 birdsongs birdsongs
0.578 0.422 0.156 birdsongs birdsongs
0.418 0.582 0.164 finance birdsongs
0.582 0.418 0.164 birdsongs birdsongs
0.582 0.418 0.164 birdsongs birdsongs
0.585 0.415 0.170 birdsongs birdsongs
0.593 0.407 0.186 birdsongs finance
0.594 0.406 0.188 birdsongs finance
0.397 0.603 0.206 finance finance
0.389 0.611 0.222 finance finance
0.615 0.385 0.230 birdsongs birdsongs

As can be seen in Table 24, there are 8 observations that the random forest model is struggles to clearly distinguish (vote_diff < 20%). Additionally, within the top 20 observations with the smallest vote difference shown, there are 7 incorrect predictions. Given that in total there are 33 incorrect predictions, and only 7 seen in this small set, it indicates that the variables may not be clearly distinguishable enough to achieve much higher accuracy than what is achieved and explains why there still seems to be errors.

b)

A boosted tree model was developed to distinguish between financial time series and audio tracks of birds. Using the xgboost engine with 1000 trees, I tuned mtry, min_n and tree_depth through a 5-fold stratified cross-validation. I created a regular grid over the ranges mtry = 1 to 5, min_n = 2 to 20 and tree_depth = 1 to 10, using 5 levels for each parameter. I selected the best hyperparameters for the model based on ROC AUC and finalised the workflow before fitting it to the full training set.

Note: I also played around with tuning other hyperparameters such as sample_size, learn_rate, loss_reduction, however, they increased compute time drastically and the added accuracy was very minimal - making it less ideal. As such, mtry, min_n and tree_depth were the only hyperparameters tuned for this final analysis.

The top 5 best performing tuned boosted tree models, based on ROC AUC, can be seen in Table 25.

Table 25: Top 5 Tuned Boosted Tree Models Ranked by ROC AUC Performance

mtry min_n tree_depth .metric .estimator mean n std_err .config
2 2 1 roc_auc binary 0.946 5 0.006 Preprocessor1_Model002
1 2 1 roc_auc binary 0.945 5 0.005 Preprocessor1_Model001
5 6 1 roc_auc binary 0.945 5 0.005 Preprocessor1_Model010
3 6 1 roc_auc binary 0.945 5 0.005 Preprocessor1_Model008
4 2 1 roc_auc binary 0.945 5 0.005 Preprocessor1_Model004

As can be seen in Table 25, the ROC AUC mean value is similar for the top 5, with the main difference in the hyperparameters being the mtry value. Noticeably, an tree_depth = 1 for all top 5, indicating a very shallow tree, however, a min_n = 2 is also consistent, indicating there must be at least 2 observations to be split. Finally, for the top model seen in the first row, mtry=1, indicating that only 1 predictor should be randomly selected at each split.

Table 26 illustrates the contribution metrics for the different variables for the boosted tree model.

Table 26: Feature Contribution Metrics for the Boosted Tree Model

Feature Gain Cover Frequency
covariate2 0.351 0.205 0.203
linearity 0.314 0.166 0.148
x_acf1 0.151 0.213 0.212
entropy 0.148 0.215 0.223
covariate1 0.037 0.201 0.214

As can be seen in Table 26, linearity is the most important feature in the boosted tree model, contributing to nearly 38.0% of the overall gain, followed by covariate2 which also contributes 30.4% of the overall gain. In the boosted tree model, we continue to see that entropy contributes very little, and x_acf1 and covariate1 negligibly improve the model’s performance. Although all variables is frequently used to split, there is a clear pattern on which variables improve the model the most, which has been somewhat consistent throughout this analysis.

Overall, the hyperparameters that will lead to the best boosted tree model can be seen in Table 27.

Table 27: Best Tuned Boosted Tree Parameters Model Based on ROC AUC

mtry min_n tree_depth
2 2 1

Table 28 and Table 29 both give us insights into how well the boosted tree performed on the training data set and the testing data set, respectively.

Table 28: Confusion Matrix on Training Data for Boosted Tree

type birdsongs finance cl_acc
birdsongs 319 12 0.964
finance 15 335 0.957

Table 29: Confusion Matrix on Testing Data for Boosted Tree

type birdsongs finance cl_acc
birdsongs 135 8 0.944
finance 20 130 0.867

As can be seen in Table 28, the boosted tree ending up correctly predicting birdsongs 97.3% of the time and finance 95.4% of the time in the training set - which is higher than the Random Forest (potentially indicating that it may be starting to overfit as it approaches 100%). When looking at the test set confusion matrix seen in Table 29, the boosted tree correctly predicting birdsongs 94.4% of the time and finance 86.0% of the time. Overall, this is quite a good result and the confusion matrix on the test set is looking quite good compared to the other models. Note that throughout, and even for the boosted tree model, the models incorrectly predict birdsong for time series that are finance more often than vice versa.

On the training set:

\[accuracy = 96.04\%\] \[balanced \ accuracy = 96.04\%\]

On the test set:

\[accuracy = 90.44\%\] \[balanced \ accuracy = 90.54\%\]

As can be seen, regarding the training set, the accuracy and balanced accuracy are very high and on the test set it continued to still perform quite well, with an accuracy of 90.1% and a balanced accuracy of 90.2%. This boosted tree model has a higher accuracy and balanced accuracy on the test set than the random forest, tuned decision tree, “bad” decision tree, LDA or Logistic Regression model.

To better understand why the boosted tree model is incorrectly predicting the type on the test set, Table 30 illustrates the probability the boosted tree model had for each type and the vote difference, arranged in order from the smallest vote to largest.

Table 30: Boosted Tree Predictions Sorted by Vote Difference

.pred_birdsongs .pred_finance vote_diff predicted_class actual
0.500 0.500 0.001 birdsongs finance
0.496 0.504 0.009 finance birdsongs
0.485 0.515 0.031 finance finance
0.528 0.472 0.056 birdsongs finance
0.542 0.458 0.083 birdsongs finance
0.547 0.453 0.094 birdsongs birdsongs
0.453 0.547 0.095 finance finance
0.443 0.557 0.115 finance finance
0.565 0.435 0.130 birdsongs finance
0.579 0.421 0.158 birdsongs birdsongs
0.581 0.419 0.162 birdsongs finance
0.398 0.602 0.204 finance finance
0.612 0.388 0.224 birdsongs finance
0.645 0.355 0.291 birdsongs birdsongs
0.665 0.335 0.330 birdsongs birdsongs
0.328 0.672 0.343 finance finance
0.328 0.672 0.343 finance birdsongs
0.675 0.325 0.351 birdsongs birdsongs
0.677 0.323 0.354 birdsongs finance
0.320 0.680 0.360 finance finance

Comparing Table 30 to that of the random forest model (Table 24), we see that there are now 11 observations that the boosted tree struggles to clearly distinguish (vote_diff < 20%) - more than the random forest. Additionally, within the top 20 observations with the smallest vote difference shown, there are 11 incorrect predictions. Given that in total there are 29 incorrect predictions, and only 11 seen in this small set, it indicates that the variables may not be clearly distinguishable enough to achieve much higher accuracy than what is achieved and explains why there still seems to be errors in the boosted tree model. Additionally, from Table 30 we can see from this that the boosted tree model is much less confident in it’s predictions than the random forest model having more closer prediction probabilities than the random forest model.

c)

Figure 14 illustrates the ROC Curve for the random forest model and the boosted tree model.

Figure 14: ROC Curves for Random Forest vs Boosted Tree Model

As can be seen in Figure 14, the ROC for both models are similar with the boosted tree ROC curve seeming to perform better better. It is good to confirm this by checking the area under each curve - with a summary provided in Table 31.

Table 31: Area Under Curve (AUC) for the Random Forest and Boosted Tree Models

Model AUC
Random Forest 0.965
Boosted Tree 0.971

Regarding the ROC Curve and the AUC, both of these models perform very well. Compared to the tuned decision tree, “bad” decision tree, LDA and Logistic Regression model’s ROC Curve and AUC, both of these perform better. Out of the two, the boosted tree model seems to be better than random forest model according to ROC Curve and AUC. Consequently, as it has the best ROC Curve and AUC, the boosted tree model is the current best model for distinguishing between financial time series and audio tracks of birds.

d)

Choice of Best Model

Overall, six different models were analysed and investigated including: Logistic Regression, LDA, “bad” decision tree, tuned decision tree, random forest and boosted tree model. Out of these six models, the models with the best ROC Curve, sorted from highest AUC to lowest, are boosted tree model, random forest, tuned decision tree, LDA, Logistic Regression, “bad” decision tree model. The LDA and Logistic Regression model are close to one another, while the tuned decision tree and the random forest are also reasonably close to one another. It makes sense though that the boosted tree model achieved a higher ROC, as each tree learns and improves on the mistakes of previous ones, unlike random forests that builds an ensemble of independent trees. Hence, speaking from a ROC Curve only, I would have to say that the best model choice is the boosted tree model - with the overall largest AUC OF 96.7%.

However, if we wanted to consider other factors such as model complexity and speed of training, I would argue that the tuned decision tree model is the most appropriate. The tuned decision tree model, although achieving third highest AUC, was able to be very quickly trained while still maintaining a high AUC of 94.6%. Both the random forest and the boosted tree models took much longer, and for a trade of of up to 2.1% of the AUC for speed compared to the boosted tree model and 0.4% of the AUC compared to the random forest model, it may be worth it.

Therefore, if I wanted a quick and rather simple to train model that still has high accuracy, I would choose the tuned decision tree model. But if time did not matter and I simply wanted to most accurate model, then I would choose the boosted tree model, even considering tuning further hyperparameters (which would drastically increase training time even more but may result in a slight gain in accuracy).

How the Time Series Typically Differ

Overall, it was found that the time series for financial data and birdsongs typically differed on two main variables. From the “bad” decision tree, tuned decision tree, random forest and boosted tree models, linearity and covariate2 were the two most significant variables in helping distinguish between birdsongs and financial time series. Even in the LDA and Logistic Regression models, these two variables had some significance. Additionally, it was learnt that covariate1 is not useful in distinguishing between the two types of data, indicating that it could probably be neglected in future.

Overall, no matter how good they are, there are still observations that the model cannot appropriately distinguish - due to their similarity - making it hard for a human to even correctly predict it. Overall, we saw that we are more likely to incorrectly finance as a birdsong than vice versa, so keeping this in mind when drawing on conclusions from any model is important.

References

Arnold, J. B. (2012). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 5.1.0. Available at: https://CRAN.R-project.org/package=ggthemes

Canty, A., & Ripley, B. D. (2021). boot: Bootstrap R (S-Plus) Functions. Available at: https://CRAN.R-project.org/package=boot

Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., … & Yuan, J. (2025). xgboost: Extreme Gradient Boosting. R package version 3.0.0.1. Available at: https://github.com/dmlc/xgboost

Cheng, J., Xie, Y., Wickham, H., Chang, W., & McPherson, J. (2023). crosstalk: Inter-Widget Interactivity for HTML Widgets. Available at: https://CRAN.R-project.org/package=crosstalk

Garnier, S., Ross, N., Rudis, B., Sciaini, M., Camargo, A. P., & Scherer, C. (2023). viridisLite: Colorblind-Friendly Color Maps (Lite Version). R package version 0.4.2. Available at: https://CRAN.R-project.org/package=viridisLite

Hart, C., & Wang, E. (2022). detourr: Portable and Performant Tour Animations. Available at: https://CRAN.R-project.org/package=detourr

Hvitfeldt, E., Silge, J., Kuhn, M., & Vaughan, D. (2023). discrim: Model Wrappers for Discriminant Analysis. Available at: https://CRAN.R-project.org/package=discrim

Kassambara, A. (2023). ggpubr: ‘ggplot2’ Based Publication Ready Plots. R package version 0.6.0. Available at: https://rpkgs.datanovia.com/ggpubr/

Kuhn, M., Wickham, H., & Weston, S. (2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles. Available at: https://www.tidymodels.org

Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R News, 2(3), 18–22. Available at: https://CRAN.R-project.org/package=randomForest

Milborrow, S. (2024). rpart.plot: Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’. R package version 3.1.2. Available at: https://CRAN.R-project.org/package=rpart.plot

Pedersen, T. L. (2025). patchwork: The Composer of Plots. R package version 1.3.0.9000. Available at: https://patchwork.data-imaginist.com/

Schloerke, B., Cook, D., Larmarange, J., Briatte, F., Marbach, M., Thoen, E., Elberg, A., & Crowley, J. (2024). GGally: Extension to ‘ggplot2’. R package version 2.2.1. Available at: https://CRAN.R-project.org/package=GGally

Sievert, C. (2020). Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC. Available at: https://plotly-r.com

Wickham, H., Cook, D., Hofmann, H., & Buja, A. (2011). tourr: An R Package for Exploring Multivariate Data with Projections. Journal of Statistical Software, 40(2), 1–18. Available at: http://www.jstatsoft.org/v40/i02/

Wickham, H., François, R., Henry, L., & Müller, K. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. DOI: https://doi.org/10.21105/joss.01686

Wickham, H., Hester, J., & Bryan, J. (2024). readr: Read Rectangular Text Data. R package version 2.1.5. Available at: https://readr.tidyverse.org

Xie, Y. (2025). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.50. Available at: https://yihui.org/knitr/

Zhu, H. (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. Available at: https://CRAN.R-project.org/package=kableExtra